Collecting Legacy Corpora from Social Science Research for Text Mining Evaluation

نویسنده

Bei Yu

چکیده

In this poster we describe a pilot study of searching social science literature for legacy corpora to evaluate text mining algorithms. The new emerging field of computational social science demands large amount of social science data to train and evaluate computational models. We argue that the legacy corpora that were annotated by social science researchers through traditional Qualitative Data Analysis (QDA) are ideal data sets to evaluate text mining methods, such as text categorization and clustering. As a pilot study, we searched articles that involve content analysis and discourse analysis in leading communication journals, and then contacted the authors regarding the availability of the annotated texts. Regretfully, nearly all of the corpora that we found were not adequately maintained, and many were no longer available, even though they were less than ten years old. This situation calls for more effort to better maintain and use legacy social science data for future computational social science research purpose.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spatializing a Digital Text Archive about History

1 Introduction The amount of digital text data available in online libraries has risen dramatically in recent years. GoogleBooks or the Universal Digital Library (UDL) initiatives illustrate this impressively. The rapid evolution of vast digital text data archives has spurred the growth of an interdisciplinary Digital Humanities (DH) community, as [1] puts it, the once inaccessible has suddenly...

متن کامل

A System for Building FrameNet-like Corpus for the Biomedical Domain

Semantic Role Labeling (SRL) plays an important role in different text mining tasks. The development of SRL systems for the biomedical area is frustrated by the lack of large-scale domain specific corpora that are annotated with semantic roles. In our previous work, we proposed a method for building FramenNet-like corpus for the area using domain knowledge provided by ontologies. In this paper,...

متن کامل

Using Statistical Properties to Enhance Text Categorization

Statistical properties extracted from text are useful in many areas. Knowing who authored some text or knowing the category of a text is among the uses of collecting such statistics. In this paper, language-independent properties of text are studied using two categorized corpora of news articles. It is observed that the properties do not depend on the corpus nor on its size. Several interesting...

متن کامل

Designing a System for Trend Analysis of Users in Website Surfing in Iran Using Data Mining and Text Mining Algorithms

Background and Aim: As of the entrance of web surfing to the lifestyle of a vast majority of people in the society and the need for a more accurate social and cultural policy making in the field, authors intended to analyze the behavior of the society users in viewing different websites so as to help politicians and practitioners. Methods: Design science research method is used in this research...

متن کامل

A Novel Approach for Sentiment Analysis Using Classifiers Naive Bayes, SVM and Modified K-Means

Sentiments, evaluations, attitudes, and emotions are the subjects of study of sentiment analysis and opinion mining. The inception and rapid growth of the field coincide with those of the social media on the Web, e.g., reviews, forum discussions, blogs, micro blogs, Twitter, and social networks, because for the first time in human history, we have a huge volume of opinionated data recorded in d...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Collecting Legacy Corpora from Social Science Research for Text Mining Evaluation

نویسنده

چکیده

منابع مشابه

Spatializing a Digital Text Archive about History

A System for Building FrameNet-like Corpus for the Biomedical Domain

Using Statistical Properties to Enhance Text Categorization

Designing a System for Trend Analysis of Users in Website Surfing in Iran Using Data Mining and Text Mining Algorithms

A Novel Approach for Sentiment Analysis Using Classifiers Naive Bayes, SVM and Modified K-Means

عنوان ژورنال:

اشتراک گذاری